CollateX and XML, Part 3

David J. Birnbaum (djbpitt@gmail.com, http://www.obdurodon.org), Last modified 2015-0y-07

This example collates ten full witnesses of Partonopeus de Blois (the files are available at the Oxford Text Archive; quasi-TEI XML files are in 2499/data/xml subdirectory of the zip file).

In Part 1 of this tutorial we collated just a single line from just four witnesses, spelling out the details step by step in a way that would not be used in a real project, but that made it easy to see how each step moves toward the final result. In Part 2 we employed three classes (WitnessSet, Line, Word) to make the code more extensible and adaptable. In Part 3 we enhance the processing by:

  1. processing the full text from all ten witnesses
  2. reading the input from files, instead of from strings within the Python code itself, and
  3. letting our Python script tell us which elements to flatten, so that we don’t have to identify them manually in advance.

The markup in the input files is similar in some respects to TEI, but the root element is <part>, obligatory TEI elements like <teiHeader> and <text> are not present, and the documents are in no namespace. Lines are tagged as <l>, and each line has @id and @n attributes. The value of the @n attribute refers to the order of the line within the individual witness, which is not relevant for collation. The @id attribute, on the other hand, represents the line number in a synopsis of all witnesses, which means that, for example, the <l id='34'> lines from all witnesses should be collated together, and similarly for other @id values. This makes it easy to identify the segments to be treated as separate collation sets; we can collate all versions of line #1 against one another, and then, separately, collate all version of line #2 against one another, etc., ultimately concatenating the results. There are two peculiarities of the @id values that are relevant here:

  • Not every line occurs in every witness. This means that when we iterate over the @id numbers, we need to accommodate gaps in the data.
  • The @id values are not only consecutive integers. Some values have appended letters, so that, for example, in witness G line 4008 is followed by 4008a and then 4009. This means that if we want to iterate over the @id values in order, we cannot rely on either purely numeric or purely string order.

Additionally, in Part 1 and Part 2 of this tutorial:

  • We didn’t worry about the order of the witnesses in the output. Now that we are dealing with multiple segments, we probably want to ensure that the witnesses are rendered in the same order in all of the segments, which means that have to sort them. For this tutorial the witness identifiers are all single upper-case Latin letters (A, B, C, F, G, L, P, T, V, W), and we’ll sort them in alphabetical order. (Alternatively, it is also possible to order them explicitly, perhaps in order to group them by hypearchetype.)
  • The witness siglum was attached to the <l> element. Now that we are dealing with full witnesses that contain multiple lines, we have to locate the siglum elsewhere.
  • The input "document" was a single <l> element, and we ignored the rest of the documents whence those single lines had been extracted manually. Now that we are dealing with complete TEI-based documents, we have to decide what to do with the rest of the content, that is, with the elements that are not just lines.

In this tutorial we ignore the other elements of the input documents except for the siglum. In Real-Life collation tasks with complete TEI documents, developers would probably want to incorporate at least some metadata from the <teiheader> components of the sources.

Load libraries. In addition to the libraries used in Part 2, we also load the os library because we will be reading input from the file system and the itertools library to help concatenate lists efficiently.


In [11]:
from collatex import *
from lxml import etree
import json,re,os,itertools

split(id)

We create our own sort function, for which we define linenoRegex, which includes two capture groups, both of which are strings by default. The first captures all digits from the beginning of the line number (@id) value. The second captures anything after the numbers. The regex splits the input into a tuple that contains the two values as strings, and we convert the first value to an integer before we return it. For example, the input value '4008a' will return (4008,'a'), where the '4008' is an integer and the 'a' is a string. We can then specify that our @id values should be sorted according to the results of processing them with this function. This overcomes the limitation of our being unable to sort them numerically (because some of them contain letters) or alphabetically (because '10' would sort before '9' alphabetically).


In [12]:
def splitId(id):
    """Splits @id value like 4008a into parts, for sorting"""
    linenoRegex = re.compile('(\d+)(.*)')
    results = linenoRegex.match(id).groups()
    return (int(results[0]),results[1])

The WitnessSet class

The WitnessSet class represents all of the witnesses being collated.

all_line_ids()

Unlike in Parts 1 and 2, where each witness contained just one line (<l> element), the witnesses now contain multiple lines. We segment the witnesses by @id value, so that each segment (set of readings to be collated) consists of lines that share an @id value. To iterate over those values, we need to get a complete list of them, and to ensure that the output is in the correct order, we need to sort them. Lines will be processed individually, segmenting the collation task into subtasks that collate just one line at a time. The all_line_ids() method returns a list of line identifiers (@id values) from all witnesses in the correct order.

generate_json(input()

The generate_json_input() method returns a JSON object that is suitable for input into CollateX.


In [13]:
class WitnessSet:
    def __init__(self,witnessList):
        self.witnessList = witnessList
    def all_witnesses(self):
        """List of tuples consisting of siglum and contents"""
        return [Witness(witness) for witness in self.witnessList]
    def all_ids(self):
        """Sorted deduplicated list of all ids in corpus"""
        return sorted(set(itertools.chain.from_iterable([witness.XML().xpath('//l/@id') for witness in self.all_witnesses()])),key=splitId)
    def get_lines_by_id(self,id):
        """List of tuples of siglum plus <l> element from each witness that corresponds to a certain line"""
        witnesses_with_line = []
        for witness in self.all_witnesses():
            try:
                witnesses_with_line.append((witness.siglum,witness.XML().xpath('//l[@id = ' + id + ']')[0]))
            except:
                pass
        return witnesses_with_line
    def generate_json_input(self, lineId):
        """JSON input to CollateX for an <l> segment"""
        json_input = {}
        witnesses = []
        for witness in self.get_lines_by_id(lineId):
            currentWitness = {}
            currentWitness['id'] = witness[0]
            currentWitness['tokens'] = Line(witness[1]).tokens()
            witnesses.append(currentWitness)
        json_input['witnesses'] = witnesses
        return json_input

The Witness class

Each witness in the witness set is an instance of class Witness. witness.siglum is a string and witness.contents is an XML tree.


In [14]:
class Witness:
    """Each witness in the witness set is an instance of class Witness"""
    def __init__(self,witness):
        self.witness = witness
        self.siglum = self.witness[0]
        self.contents = self.witness[1]
    def XML(self):
        return etree.XML(self.contents)

The Line class

The Line class contains methods applied to individual lines. The XSLT stylesheets and the functions to use them have been moved into the Line class, since they apply to individual lines. The siglum for the line is retrieved from the witness that contains it, and is part of the Witness class. The line.tokens() method returns a list of JSON objects, one for each word token.


In [15]:
class Line:
    """An instance of Line is a line in a witness, expressed as an <l> element"""
    addWMilestones = etree.XML("""
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
        <xsl:template match="*|@*">
            <xsl:copy>
                <xsl:apply-templates select="node() | @*"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="@*"/>
                <!-- insert a <w/> milestone before the first word -->
                <w/>
                <xsl:apply-templates/>
            </xsl:copy>
        </xsl:template>
        <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
             CUSTOMIZE HERE: add other elements that may span multiple word tokens
        -->
        <xsl:template match="add | sic | crease ">
            <xsl:element name="{name()}">
                <xsl:attribute name="n">start</xsl:attribute>
            </xsl:element>
            <xsl:apply-templates/>
            <xsl:element name="{name()}">
                <xsl:attribute name="n">end</xsl:attribute>
            </xsl:element>
        </xsl:template>
        <xsl:template match="note"/>
        <xsl:template match="text()">
            <xsl:call-template name="whiteSpace">
                <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
            </xsl:call-template>
        </xsl:template>
        <xsl:template name="whiteSpace">
            <xsl:param name="input"/>
            <xsl:choose>
                <xsl:when test="not(contains($input, ' '))">
                    <xsl:value-of select="$input"/>
                </xsl:when>
                <xsl:when test="starts-with($input,' ')">
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring($input,2)"/>
                    </xsl:call-template>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($input, ' ')"/>
                    <w/>
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring-after($input,' ')"/>
                    </xsl:call-template>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:template>
    </xsl:stylesheet>
    """)
    transformAddW = etree.XSLT(addWMilestones)
    xsltWrapW = etree.XML('''
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="w"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="w">
            <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
            <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
            <w>
                <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
            </w>
        </xsl:template>
    </xsl:stylesheet>
    ''')
    transformWrapW = etree.XSLT(xsltWrapW)
    def __init__(self,line):
        self.line = line
    def tokens(self):
        return [Word(token).createToken() for token in Line.transformWrapW(Line.transformAddW(self.line)).xpath('//w')]

The Word class contains methods that apply to individual words. unwrap() and normalize() are private; they are used by createToken() to return a JSON object with the "t" and "n" properties for a word token.


In [16]:
class Word:
    unwrapRegex = re.compile('<w>(.*)</w>')
    stripTagsRegex = re.compile('<.*?>')
    def __init__(self,word):
        self.word = word
    def unwrap(self):
        return Word.unwrapRegex.match(etree.tostring(self.word,encoding='unicode')).group(1)
    def normalize(self):
        return Word.stripTagsRegex.sub('',self.unwrap().lower())
    def createToken(self):
        token = {}
        token['t'] = self.unwrap()
        token['n'] = self.normalize()
        return token

Create XML data and assign to a witnessSet variable

Our witnesses are XML files in the 'partonopeus' subdirectory of our current location. Verify that the files are there by listing them.


In [17]:
os.listdir('partonopeus')


Out[17]:
['A.xml',
 'B.xml',
 'C.xml',
 'F.xml',
 'G.xml',
 'L.xml',
 'P.xml',
 'T.xml',
 'V.xml',
 'X.xml']

Create a two-member tuple for each file, consisting of two strings: the one-letter identifier (the filename with the '.xml' extension removed) and the contents of the files. Assemble these into a list of tuples and use it to create an instance of the WitnessSet class, assigned to the variable witnessSet. We use the lxml library to parse the XML and a file that contains Unicode data must be opened in raw (bytes) mode.


In [18]:
witnessSet = WitnessSet([(inputFile[0],open('partonopeus/' + inputFile,'rb').read()) for inputFile in os.listdir('partonopeus')])

Generate sample JSON from a random line of data and examine it


In [19]:
json_input = witnessSet.generate_json_input('10')
print(json_input)


{'witnesses': [{'id': 'A', 'tokens': [{'t': 'De', 'n': 'de'}, {'t': 'mon', 'n': 'mon'}, {'t': 'segnor', 'n': 'segnor'}, {'t': 'la', 'n': 'la'}, {'t': 'gracie', 'n': 'gracie'}, {'t': 'issi', 'n': 'issi'}]}, {'id': 'B', 'tokens': [{'t': 'De', 'n': 'de'}, {'t': 'mo<abbrev>n</abbrev>segnor', 'n': 'monsegnor'}, {'t': 'la', 'n': 'la'}, {'t': 'g<abbrev>ra</abbrev>sce', 'n': 'grasce'}, {'t': 'issi', 'n': 'issi'}]}, {'id': 'G', 'tokens': [{'t': 'De', 'n': 'de'}, {'t': 'mo<abbrev>n</abbrev>seignor', 'n': 'monseignor'}, {'t': 'la', 'n': 'la'}, {'t': 'g<abbrev>ra</abbrev>ce', 'n': 'grace'}, {'t': 'eisi', 'n': 'eisi'}]}, {'id': 'L', 'tokens': [{'t': 'De', 'n': 'de'}, {'t': 'mon', 'n': 'mon'}, {'t': 'segnor', 'n': 'segnor'}, {'t': 'la', 'n': 'la'}, {'t': 'grace', 'n': 'grace'}, {'t': 'ensi', 'n': 'ensi'}]}, {'id': 'P', 'tokens': [{'t': 'Se', 'n': 'se'}, {'t': 'mo<abbrev>n</abbrev>seignour', 'n': 'monseignour'}, {'t': 'sa', 'n': 'sa'}, {'t': 'g<abbrev>ra</abbrev>ce', 'n': 'grace'}, {'t': 'einsi', 'n': 'einsi'}]}, {'id': 'V', 'tokens': [{'t': 'de', 'n': 'de'}, {'t': 'mun', 'n': 'mun'}, {'t': 'seignor', 'n': 'seignor'}, {'t': 'sa', 'n': 'sa'}, {'t': 'grace', 'n': 'grace'}, {'t': 'issi', 'n': 'issi'}]}]}

Collate and output the results of the sample as a plain-text alignment table, as JSON, and as colored HTML


In [20]:
collationText = collate(json_input,output='table')
print(collationText)
collationJSON = collate(json_input,output='json')
print(collationJSON)
collationHTML2 = collate(json_input,output='html2')


+---+--------------------------------+-----------------------------+----+-------------------------+-------+
| A | De                             | monsegnor                   | la | gracie                  | issi  |
| B | De                             | mo<abbrev>n</abbrev>segnor  | la | g<abbrev>ra</abbrev>sce | issi  |
| G | De                             | mo<abbrev>n</abbrev>seignor | la | g<abbrev>ra</abbrev>ce  | eisi  |
| L | De                             | monsegnor                   | la | grace                   | ensi  |
| P | Semo<abbrev>n</abbrev>seignour | -                           | sa | g<abbrev>ra</abbrev>ce  | einsi |
| V | de                             | munseignor                  | sa | grace                   | issi  |
+---+--------------------------------+-----------------------------+----+-------------------------+-------+
{"table": [[[{"n": "de", "t": "De"}], [{"n": "mon", "t": "mon"}, {"n": "segnor", "t": "segnor"}], [{"n": "la", "t": "la"}], [{"n": "gracie", "t": "gracie"}], [{"n": "issi", "t": "issi"}]], [[{"n": "de", "t": "De"}], [{"n": "monsegnor", "t": "mo<abbrev>n</abbrev>segnor"}], [{"n": "la", "t": "la"}], [{"n": "grasce", "t": "g<abbrev>ra</abbrev>sce"}], [{"n": "issi", "t": "issi"}]], [[{"n": "de", "t": "De"}], [{"n": "monseignor", "t": "mo<abbrev>n</abbrev>seignor"}], [{"n": "la", "t": "la"}], [{"n": "grace", "t": "g<abbrev>ra</abbrev>ce"}], [{"n": "eisi", "t": "eisi"}]], [[{"n": "de", "t": "De"}], [{"n": "mon", "t": "mon"}, {"n": "segnor", "t": "segnor"}], [{"n": "la", "t": "la"}], [{"n": "grace", "t": "grace"}], [{"n": "ensi", "t": "ensi"}]], [[{"n": "se", "t": "Se"}, {"n": "monseignour", "t": "mo<abbrev>n</abbrev>seignour"}], null, [{"n": "sa", "t": "sa"}], [{"n": "grace", "t": "g<abbrev>ra</abbrev>ce"}], [{"n": "einsi", "t": "einsi"}]], [[{"n": "de", "t": "de"}], [{"n": "mun", "t": "mun"}, {"n": "seignor", "t": "seignor"}], [{"n": "sa", "t": "sa"}], [{"n": "grace", "t": "grace"}], [{"n": "issi", "t": "issi"}]]], "witnesses": ["A", "B", "G", "L", "P", "V"]}
A B G L P V
De De De De Semonseignour de
monsegnor mon segnor mon seignor monsegnor - munseignor
la la la la sa sa
gracie gra sce gra ce grace gra ce grace
issi issi eisi ensi einsi issi

In [ ]: